Search CORE

38 research outputs found

Recommended from our members

Empowering Responsible Use of Large Language Models

Author: Zhao Xuandong
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

The rapid advancement of powerful Large Language Models (LLMs), such as ChatGPT and Llama, has revolutionized the world by bringing new creative possibilities and enhancing productivity. However, these advancements also pose significant challenges and risks, including the potential for misuse in the form of fake news, academic dishonesty, intellectual property infringements, and privacy leaks. In response to these concerns, this thesis explores approaches to promoting the responsible use of LLMs from both theoretical and empirical perspectives.Three key approaches are presented: (1) Detecting AI-generated Text via Watermarking: We propose a robust and high-quality watermarking method called Unigram-Watermark and introduce a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. Furthermore, we propose PF-Watermark, which achieves the best balance of high detection accuracy and low perplexity. (2) Protecting the Intellectual Property of LLMs: We safeguard the intellectual property of LLMs through novel watermarking techniques designed to prevent model-stealing attacks in both text classification and text generation tasks. (3) Privacy-Preserving LLMs: We employ Confidential Redacted Training (CRT) to train and fine-tune language generation models while protecting sensitive information. In summary, we propose a suite of algorithms and solutions to address LLMs' trending safety, security, and privacy concerns. We hope our studies provide valuable insights for researchers to explore exciting future research solutions that promote responsible AI development and deployment

eScholarship - University of California

Provable Robust Watermarking for AI-Generated Text

Author: Ananth Prabhanjan
Li Lei
Wang Yu-Xiang
Zhao Xuandong
Publication venue
Publication date: 13/10/2023
Field of study

We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark

arXiv.org e-Print Archive

Generative Autoencoders as Watermark Attackers: Analyses of Vulnerabilities and Threats

Author: Li Lei
Wang Yu-Xiang
Zhang Kexun
Zhao Xuandong
Publication venue
Publication date: 02/06/2023
Field of study

Invisible watermarks safeguard images' copyrights by embedding hidden messages detectable by owners. It also prevents people from misusing images, especially those generated by AI models. Malicious adversaries can violate these rights by removing the watermarks. In order to remove watermarks without damaging the visual quality, the adversary needs to erase them while retaining the essential information in the image. This is analogous to the encoding and decoding process of generative autoencoders, especially variational autoencoders (VAEs) and diffusion models. We propose a framework using generative autoencoders to remove invisible watermarks and test it using VAEs and diffusions. Our results reveal that, even without specific training, off-the-shelf Stable Diffusion effectively removes most watermarks, surpassing all current attackers. The result underscores the vulnerabilities in existing watermarking schemes and calls for more robust methods for copyright protection

arXiv.org e-Print Archive

An open framework for semantic code queries on heterogeneous repositories

Author: Li Xuandong
Pan Minxue
Yu Yijun
Zhang Tian
Zhao Jizhou
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/09/2015
Field of study

To help developers understand and reuse programs, semantic queries on the source code itself is attractive. Although programs in heterogeneous languages are being controlled for collaborative software development, most queries supported by various source code repositories are based either on the metadata of the repositories, or on indexed identifiers and method signatures. Few provide full support to search for structures that are common across different programming languages and different viewpoints (hence heterogeneous). To facilitate understanding and reuses, in this paper, we propose a novel source code query framework that (1) transforms source code to a unified abstract syntax format, and handles heterogeneity (non-isomorphism) at the abstract syntax level; (2) stores source code on a cloud-based NoSQL storage in MongoDB; (3) rewrites semantic query patterns into the NoSQL form. The efficiency of the framework has been evaluated to support several open-source hosting platforms

Crossref

Open Research Online

"Private Prediction Strikes Back!'' Private Kernelized Nearest Neighbors with Individual Renyi Filter

Author: Guo Chuan
Wang Yu-Xiang
Zhao Xuandong
Zhu Yuqing
Publication venue
Publication date: 12/06/2023
Field of study

Most existing approaches of differentially private (DP) machine learning focus on private training. Despite its many advantages, private training lacks the flexibility in adapting to incremental changes to the training dataset such as deletion requests from exercising GDPR's right to be forgotten. We revisit a long-forgotten alternative, known as private prediction, and propose a new algorithm named Individual Kernelized Nearest Neighbor (Ind-KNN). Ind-KNN is easily updatable over dataset changes and it allows precise control of the R\'{e}nyi DP at an individual user level -- a user's privacy loss is measured by the exact amount of her contribution to predictions; and a user is removed if her prescribed privacy budget runs out. Our results show that Ind-KNN consistently improves the accuracy over existing private prediction methods for a wide range of

\epsilon

on four vision and language tasks. We also illustrate several cases under which Ind-KNN is preferable over private training with NoisySGD

arXiv.org e-Print Archive

Normal T1 relaxometry and extracellular volume of the pancreas in subjects with no pancreas disease: correlation with age and gender

Author: Li Liang
Lin Chen
Mitchell Jacob R.
Tirkes Temel
Zhao Xuandong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/05/2019
Field of study

OBJECTIVE Determine normal T1 and extracellular volume (ECV) of the pancreas in subjects with no pancreas disease and correlate with age and gender SUBJECTS AND METHODS We imaged 120 healthy subjects (age range: 20-78 years) who are on annual screening with MRI/MRCP for the possibility of pancreatic cancer. Subjects had a predisposition to develop pancreatic cancer, but no history of pancreas disease or acute symptoms. Equal number (n=60) of subjects were scanned on either 1.5 T or 3 T scanner using dual flip angle spoiled gradient echo technique incorporating fat suppression and correction for B1 field inhomogeneity. Optimization of imaging parameters were performed using a T1 phantom. ECV was calculated using pre- and post-contrast T1 of the pancreas and plasma. Regression analysis and Mann-Whitney tests were used for statistical analysis. RESULTS Median T1 on 1.5 T was 654 ms (IQR: 608-700); median T1 on 3 T was 717 ms (IQR: 582-850); median ECV on 1.5 T was 0.28 (IQR: 0.21-0.33) and median ECV on 3 T was 0.25 (IQR: 0.19-0.28). Age had a mild positive correlation with T1 (r= 0.24, p= 0.009), but not with ECV (r= 0.06, p=0.54). T1 and ECV were similar in both genders (p >0.05). CONCLUSION This study measured the median T1 and ECV of the pancreas in subjects with no pancreas disease. Pancreas shows longer T1 relaxation times in older population, whereas extracellular fraction remains unchanged. Median T1 values were different between two magnet strengths; however, no difference was seen between genders and ECV fractions

IUPUIScholarWorks

Positive loop-closed automata: a decidable class of hybrid systems

Author: Li Xuandong
Li Yong
Pei Yu
Zhao Jianhua
Zheng Guoliang
Zheng Tao
Publication venue: Elsevier Science Inc.
Publication date: 31/08/2002
Field of study

AbstractThe model-checking problem for real-time and hybrid systems is very difficult, even for a well-formed class of hybrid systems—the class of linear hybrid automata—the problem is still undecidable in general. So an important question for the analysis and design of real-time and hybrid systems is the identification of subclasses of such systems and corresponding restricted classes of analysis problems that can be settled algorithmically. In this paper, we show that for a class of linear hybrid automata called positive loop-closed automata, the satisfaction problem for linear duration properties can be solved by linear programming. We extend the traditional regular expressions with duration constraints and use them as a language to describe the behaviour of this class of linear hybrid automata. The extended notation is called duration-constrained regular expressions. Based on this formalism, we show that the model-checking problem can be reduced formally to linear programs

Elsevier - Publisher Connector

Predicting Alzheimer's Disease by Hierarchical Graph Convolution from Positron Emission Tomography Imaging

Author: Guo Jiaming
Guo Ning
Li Quanzheng
Li Xiang
Qiu Wei
Zhao Xuandong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/09/2019
Field of study

Imaging-based early diagnosis of Alzheimer Disease (AD) has become an effective approach, especially by using nuclear medicine imaging techniques such as Positron Emission Topography (PET). In various literature it has been found that PET images can be better modeled as signals (e.g. uptake of florbetapir) defined on a network (non-Euclidean) structure which is governed by its underlying graph patterns of pathological progression and metabolic connectivity. In order to effectively apply deep learning framework for PET image analysis to overcome its limitation on Euclidean grid, we develop a solution for 3D PET image representation and analysis under a generalized, graph-based CNN architecture (PETNet), which analyzes PET signals defined on a group-wise inferred graph structure. Computations in PETNet are defined in non-Euclidean, graph (network) domain, as it performs feature extraction by convolution operations on spectral-filtered signals on the graph and pooling operations based on hierarchical graph clustering. Effectiveness of the PETNet is evaluated on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which shows improved performance over both deep learning and other machine learning-based methods.Comment: Jiaming Guo, Wei Qiu and Xiang Li contribute equally to this wor

arXiv.org e-Print Archive

Crossref